minimaxm3-fp8-mi355x-vllm-disagg#1762
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you
PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515117946 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27515119215 |
First sweep failure — diagnosed & fixedThe first disagg sweep (run 27515119215) failed — not a recipe bug. The day-zero
Fix:
Scoped to the vllm-disagg branch; pre-staged models (M2.5/Kimi) never reach this path. Re-running the sweep. |
8118fa3 to
a4f66bd
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27519206250 |
a4f66bd to
409561f
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27520697241 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27521167091 |
…tness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
7b33cf1 to
01ed5b8
Compare
Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27525928087 |
…tness) The conc-1 1k1k smoke test never triggered an eval — the multi-node eval policy only marks 8k1k entries with conc >= MIN_EVAL_CONC (16). Add an 8k1k conc-16 row (same 1P TP8 + 1D TP8 layout) so mark_eval_entries marks it run-eval=true (eval-conc=16), running lm-eval through the MoRI-IO disagg pipeline to validate correctness. The conc-1 1k1k row stays the latency smoke test. Run with non-canary-full-sweep-enabled so the (non-min-conc) eval entry runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Widen the 1k1k disagg latency/throughput sweep from conc 1 to conc 1,2,4,8,16 (1P TP8 + 1D TP8). The 8k1k conc-16 eval row is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Widen the disagg sweep from conc 1 to conc 1,2,4,8,16 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
vllm/vllm-router only retains ~16 recent nightlies on Docker Hub; older dated tags are garbage-collected (manifest unknown), which makes `docker run` fail with exit 125 on any node that has not already cached the image.
MiniMax-M3 (MiniMaxM3SparseForCausalLM) is a hybrid sparse-attention model: sparse layers register a separate lightning-indexer cache (MLAAttentionSpec, rank-3, bf16, key-only) alongside the main cache (FullAttentionSpec, rank-5, fp8, K+V). The MoRIIO connector assumes one uniform KV layout -- it derives block geometry from the first cache and reuses first_layer's offsets for every layer (see its own "hybrid attn" TODO) -- so the bf16 key-only index cache is transferred with fp8 K+V sizing and gets corrupted on the decode worker, producing garbage output (disagg gsm8k ~= 0 while single-node M3 is correct). This is the vLLM analogue of the SGLang MoRI DSA-state bug in patches/mori_conn.py. - patches/moriio_heterogeneous_kv.py: compute the READ-path transfer geometry per layer (own shape/stride/dtype/rank) instead of from the first cache. Idempotent; no-op for homogeneous models. - setup_deps.sh: apply it on the vllm-disagg path. NOTE: partial fix -- necessary but not yet sufficient. The index cache is also a separate KV-cache group whose block-table/num_blocks the single-namespace MoRIIO connector cannot map, so M3 disagg accuracy is still broken pending a larger multi-group / index-state transfer change. (Disabling sparse attention is not a viable workaround: M3's fused QKV carries index_k weights, so dropping the indexer breaks weight load.) Refs #1762 Co-authored-by: Cursor <cursoragent@cursor.com>
…max-m3 image The vLLM MoRIIOConnector in vllm/vllm-openai-rocm:minimax-m3 assumes the FlashAttention KV layout [2, num_blocks, ...] (K/V axis outer) but this vLLM's backends allocate [num_blocks, 2, ...] (K/V axis inner), so every disagg block transfer reads the wrong region. Invisible to throughput, but corrupts GQA/non-MLA accuracy (MiniMax-M3 gsm8k 0.0008 -> 0.957). Instead of baking a fix into a rebuilt image (-hetkv) or carrying full vendored copies of the patched files in-tree, carry just the 218-line unified diff (patches/moriio/moriio-kv-layout-fix.diff) and apply it with `patch -p1` against the vLLM package dir inside the container at startup, ahead of the server launch. The repo is already bind-mounted into the container, so no EXTRA_DOCKER_MOUNTS wiring is needed -- job.slurm auto-applies the diff when DOCKER_IMAGE_NAME contains "minimax-m3" (skippable with MORIIO_KV_PATCH=skip), mirroring the existing mori_conn.py sglang hook. A failed apply aborts the container instead of silently running unpatched. Validated on a manual 2-node run (n06-21 prefill+router / n09-21 decode) using the STOCK image: gsm8k strict-match 0.9568 / flexible-extract 0.9560 (matches the baked image within noise), decode probe healthy. - patches/moriio/moriio-kv-layout-fix.diff: unified diff vs stock - job.slurm: in-container `patch` step, MORIIO_KV_PATCH=skip opt-out - patches/README.md: document the moriio/ diff-apply mechanism Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… 8k1k Widen the disagg sweep from conc 1,2,4,8,16 to 1,2,4,8,16,32,64,128,256,512,1024 for both seq-len scenarios (1P TP8 + 1D TP8). The 8k1k conc-16 point keeps the multi-node eval marked (eval-conc=16) so lm-eval still validates the MoRI-IO disagg pipeline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…and 8k1k Add two asymmetric prefill/decode layouts alongside the existing TP8+TP8 sweep, for both seq-len scenarios: - 1P TP4 + 1D TP8 (smaller prefill, full-node decode) at conc 1..256 - 1P TP4 + 1D TP4 (balanced half-node) at conc 64..1024 Per-worker TP is driven by the master-config prefill/decode tp: server_vllm.sh sed-rewrites the models_vllm.yaml --tensor-parallel-size 8 placeholder to the computed PREFILL_TP_SIZE/DECODE_TP_SIZE, so no models_vllm.yaml flag change is needed (comment updated to say so). The multinode eval policy still marks exactly one lm-eval (groups by dp-attn, not TP) on the TP8+TP8 8k1k layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d MoRIIO diff Replaces moriio-kv-layout-fix.diff with moriio-minimax-m3-disagg.diff, which bundles three layered fixes for the stock minimax-m3 vLLM image: 1. KV-layout: axis-aware per-layer block offsets (the gsm8k 0.0008→0.958 fix, required for homogeneous TP too). 2. heterogeneous-TP addressing + guard: maps each decode rank to the correct prefill rank (tp_rank // ratio) for PREFILL_TP_SIZE != DECODE_TP_SIZE, and raises NotImplementedError for unsupported cases (prefill-TP > decode-TP, KV-head splitting) instead of silently corrupting KV. 3. dup-ack fan-in: with DECODE_TP_SIZE > PREFILL_TP_SIZE, producer counts ACKs per transfer_id and only frees KV blocks once all expected consumers ACK, preventing both the late-ACK EngineCore crash and KV reuse before slower decode ranks finish reading. job.slurm and patches/README.md updated to reference the new diff name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With P8/D4 and 4 KV heads, vLLM distributes heads across prefill ranks in consecutive pairs: (rank0,rank1)→head0, (rank2,rank3)→head1, etc. The previous patch used `return self.tp_rank` for the P>D branch, which made decode rank 1 connect to prefill rank 1 (holds head0) instead of prefill rank 2 (holds head1) — corrupting KV for all decode ranks except 0. Fix: use `self.tp_rank * ratio` (ratio = remote_tp_size // local_tp_size), the symmetric counterpart to the D>P case's `tp_rank // ratio`. This maps each decode rank to the *first* prefill rank of its head group, which holds the correct KV content via vLLM's replication scheme. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ansion The P>D fix added 4 lines to _remote_tp_rank but the hunk header still said +1100,40; patch aborted with "malformed patch at line 79". Update to +1100,44 to match the actual 6 context + 38 added lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The MoRIIO KV-layout patch was injected into the per-node container launch
via '"${_MORIIO_PATCH_CMD:-}"', which breaks out of the outer
srun bash -c "..." double-quoted string. Because the patch command value
contains spaces and the shell operators '<' and '||', the unquoted
expansion word-split the generated container script, truncating it right
after the word `patch` and silently dropping the patch arguments AND the
server.sh launch. The container then exited 0:0 within seconds, producing
no benchmark/eval output -> collect_latest_results found "No logs
directory" -> the launch step failed with exit 1 (all minimax-m3 disagg
jobs affected).
Fix: expand ${_MORIIO_PATCH_CMD:-} directly inside the inner bash -lc
single quotes (no quote toggling), so the patch command stays intact and
its operators are parsed by the container shell. Validated end-to-end:
gsm8k recovers from ~0 (garbage) to 0.94-0.98 across P8D8/P4D8/P8D4.
Co-authored-by: Cursor <cursoragent@cursor.com>
…1k & 8k1k) Two TP4 prefill workers (num-worker 2, PREFILL_NODES=2, each TP4 on half an 8-GPU node) feeding one TP8 decode (DECODE_NODES=1) — 3 nodes total. Added to both seq-len scenarios at conc 256,512,768,1024. Eval marking unchanged (still one lm-eval on the 8k1k TP8+TP8 layout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1677806 to
aad872a
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27916728634 |
The per-layer READ-offset fix this Python patcher applied to
moriio_connector.py is fully subsumed by the unified overlay
patches/moriio/moriio-minimax-m3-disagg.diff, which job.slurm applies
with `patch -p1` BEFORE server.sh sources setup_deps.sh. The diff
rewrites the exact lines the patcher searches for (the `first_layer`
single-offset block and the `is_mla = len(self.kv_cache_shape)` sizing),
with a stronger geometry-memoized + heterogeneous-TP-aware version, so
the patcher's OLD1/OLD2 patterns no longer match and it already no-ops
("pattern not found; skipping") in the real flow. It's also the same
fix now upstreamed in vLLM #46039 (READ mixed KV layouts).
Drop the dead patcher and its setup_deps.sh hook so the diff is the
single source of truth. patches/README.md only documents the diff (no
reference to this patcher), so no README change is needed.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27968834654 |
|
@functionstackx All three related PRs have been merged. PR 1: vllm-project/vllm#46039 |
- Co-work with Gupta, Ravi
All three MoRIIO fixes the in-tree overlay carried have merged upstream and now
ship in the ROCm nightly image:
- vLLM #46039 READ-mode mixed KV-layout (axis-aware per-layer offsets)
- vLLM #46290 WRITE-mode per-geometry offset caching
- vLLM #46332 heterogeneous-TP rank mapping + ACK fan-in
Point minimaxm3-fp8-mi355x-vllm-disagg at
vllm/vllm-openai-rocm:nightly-556bc4e3a089378e9df2482659898192da18db15
(vLLM 0.23.1rc1.dev363+g556bc4e3a, which contains all three merges) and remove
the stop-gap overlay:
- delete patches/moriio/moriio-minimax-m3-disagg.diff
- drop the job.slurm in-container auto-apply block (+ MORIIO_KV_PATCH gate)
- trim the moriio/ section from patches/README.md
Verified on the nightly image with NO patch across all four P/D layouts x
conc {1,4,8}, gsm8k strict/flexible 0.95-0.97 (1P8+1D8, 1P4+1D8, 1P4+1D4,
2P4+1D8) -- matching the previously-patched results.
Refs #1762.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28101174324 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28101174324 |
|
/reuse-sweep-run |
The minimaxm3-fp8-mi355x-vllm-disagg entry was inserted mid-file (after the #1862 entry), which violates the append-only changelog gate ("entry 511 changed; existing entries are immutable"). Move it to the end of perf-changelog.yaml so existing entries stay byte-identical to main and the new entry is a clean append. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/reuse-sweep-run |
What
MiniMax-M3 MXFP8 MI355X vLLM disaggregated (prefill/decode) benchmark on the day-zero ROCm image (
vllm/vllm-openai-rocm:minimax-m3):num-worker 2,PREFILL_NODES=2, 3 nodes total) — conc 256,512,768,1024prefill/decode.tp:server_vllm.shsed-rewrites the--tensor-parallel-size 8placeholder inmodels_vllm.yamlto the computedPREFILL_TP_SIZE/DECODE_TP_SIZE(TP4 uses half an 8-GPU node; node counts set viaPREFILL_NODES/DECODE_NODES— the 1P layouts use 2 nodes, the 2P TP4 layout uses 3)minimaxm3-fp8-mi355x-vllm-disaggUpstream MoRI-IO fixes: all three vLLM PRs merged
This PR runs inter-node disaggregation — prefill node(s) + a decode node, KV transferred across nodes over MoRI-IO. Its correctness (the 8k1k gsm8k eval) depends on MoRIIO fixes that were originally carried here as a runtime overlay against the day-zero
minimax-m3image. Per the upstream plan (tanpinsiang, 2026-06-20), the work was split into three staged vLLM PRs, all staged from this PR (#1762). All three required upstream PRs are now merged:1. READ-mode mixed KV layouts — MERGED: vLLM #46039 "[ROCm][P/D] Support MiniMax-M3 mixed KV layouts in MoRIIO READ mode" (
junkang1991, AMD; tracks vLLM issue #45885). The connector reused the first layer's offsets and assumed a single KV layout, but M3 registers three per-layer formats — separated[2, num_blocks, …], ROCm-interleaved[num_blocks, 2, …], and the rank-3 key-only indexer[num_blocks, block_size, head_dim]— so transfers read the wrong region (invisible to throughput; gsm8k0.0008token salad). The fix makes READ offsets per-layer / layout-aware viaKVCacheSpec. Merged 2026-06-21; validated intra-node 1P1D TP4+TP4 GSM8K ≈ 0.955.2. WRITE per-geometry offset caching — MERGED: vLLM #46290 "[ROCm][P/D] Fix MoRIIO WRITE mode for mixed KV layouts" (
tanpinsiang). Scope:MoRIIOWriter._prepare_transfer_plancaches WRITE offsets per KV-cache geometry instead of one request-wide offset tuple — the WRITE half this PR'smoriio_engine.pyoverlay already carries. Merged 2026-06-23.3. Heterogeneous-TP rank mapping + ACK fan-in — MERGED: vLLM #46332 "[ROCm][P/D] Support MoRIIO heterogeneous TP fan-in" (
tanpinsiang). Scope: remote TP rank mapping, READ notification target, plain ACK parsing, fan-in ACK counting, duplicate-ACK handling — what makes prefill-TP ≠ decode-TP across nodes work (the het-TP / dup-ACK fixes this PR's overlay carries). Merged 2026-06-23.Our stop-gap overlay bundles all three fixes so we can reuse the stock
minimax-m3image today:benchmarks/multi_node/amd_utils/patches/moriio/(moriio_connector.pyREAD +moriio_engine.pyWRITE +moriio_common.pyper-geometry cache) +patches/moriio_heterogeneous_kv.py, auto-mounted byjob.slurmwhenDOCKER_IMAGE_NAMEcontainsminimax-m3(MORIIO_KV_PATCH=skipto disable). Inter-node disagg gsm8k =strict-match 0.9583 / flexible-extract 0.9575, matching single-node. Seepatches/README.md.Next unblock step: pick up a published
minimax-m3image that contains #46039, #46290, and #46332; once that image is available and validated, thepatches/moriio/overlay +job.slurmauto-mount can be dropped.Layered on #1585 (remove vLLM-disagg MoRI patches)
This PR brings in #1585's MoRI-patch-removal infra (that PR is very stale vs
main, so the changes are applied selectively rather than by merge):amd_utils/{setup_deps.sh, server_vllm.sh, submit.sh, models_vllm.yaml}— taken from [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585 (mainis untouched here since the merge-base, so these equalmain+ the mori removal). Includes--all2all-backend mori→mori_low_latencyfor the existing M2.5/Kimi entries.amd_utils/job.slurm— [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585's two vLLM-disagg hunks applied onto currentmain(keepingmain's atom-disagg support): vllm-router imagenightly-20260511-e667ebb→nightly-20260603-e667ebb, and drop theVLLM_MORIIO_CONNECTOR_READ_MODEenv from thevllm-disaggcontainer block.M3 recipe
benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh— model-agnostic disagg boilerplate (byte-identical to the M2.5 disagg script; the launcher resolves the per-SKU script by name).models_vllm.yamlMiniMax-M3-MXFP8— per-worker serve flags:--block-size 128(MSA sparse/index cache),--language-model-only(text-only benchmark),--kv-cache-dtype fp8(gfx950),--attention-backend TRITON_ATTN,minimax_m3tool/reasoning parsers; no EP (MoE experts TP-sharded as in the single-node M3 recipe). The--tensor-parallel-size 8is a placeholder rewritten per-worker at launch. Env:VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_USE_BREAKABLE_CUDAGRAPH=0 VLLM_ENGINE_READY_TIMEOUT_S=3600.Scope guard
perf-changelog.yamland.github/configs/amd-master.yamlcontain only M3 changes vsmain.Validation
validate_perf_changelog.pyappend-only gate → 1 appended entry, 0 pr-link corrections ✓generate_sweep_configs test-config→ 6 disagg configs (3 layouts × {1k1k, 8k1k}); exactly 1run-eval=true, on 8k1k TP8+TP8 witheval-conc 128; all 1k1k entriesrun-eval=false✓minimaxm3 / fp8 / vllm-disagg→benchmarks/multi_node/minimaxm3_fp8_mi355x_vllm-disagg.sh✓process_changelog.pyselectsminimaxm3-fp8-mi355x-vllm-disagg✓🤖 Generated with Claude Code
Note
Medium Risk
Touches disaggregated KV transfer and runtime patching of vLLM inside containers—incorrect offsets or ACK handling would corrupt accuracy or crash engines; benchmark-only scope limits production blast radius.
Overview
Adds
minimaxm3-fp8-mi355x-vllm-disaggtoamd-master.yaml: multi-node vLLM prefill/decode onvllm/vllm-openai-rocm:minimax-m3, sweeping 1k1k and 8k1k concurrency across four P/D layouts (1P TP8 + 1D TP8, 1P TP4 + 1D TP8, 1P TP4 + 1D TP4, 2P TP4 + 1D TP8), with 8k1k wired for one gsm8k eval on the TP8+TP8 layout.MoRIIO correctness on the stock image: ships
patches/moriio/moriio-minimax-m3-disagg.diff(KV layout, heterogeneous-TP rank mapping, dup-ack fan-in) andjob.slurmauto-applies it inside the container forminimax-m3images beforeserver.shruns; failed patch aborts the job. Documents the overlay inpatches/README.md.Serving / infra: new
models_vllm.yamlMiniMax-M3-MXFP8recipe and launcher scriptminimaxm3_fp8_mi355x_vllm-disagg.sh(cluster HF cache path for the ~414GB checkpoint).server_vllm.shsets MoRIIOread_mode: trueinkv_connector_extra_configinstead ofVLLM_MORIIO_CONNECTOR_READ_MODE.setup_deps.shdrops large in-container Python MoRIIO/scheduler patches (relies on image + unified diff). Kimi/M2.5 disagg flags usemori_low_latency; vllm-router default tag bumped tonightly-20260617-e667ebb.perf-changelog.yamlentry added.Reviewed by Cursor Bugbot for commit 33b3fd2. Bugbot is set up for automated code reviews on this repo. Configure here.